org_df <- read_excel("wuhan_blood_sample_data_Jan_Feb_2020.xlsx")

Dataset cleaning

The following steps were undertaken to clean the original dataset:
  • replace the gender values, from numeric (1, 2) to factor (male, female),
  • replace the outcome values, from numeric (0,1) to factor (survived, Died),
  • unification of column name from ‘Admission time’ to admission_time,
  • unification of column name from ‘Discharge time’ to discharge_time,
  • replace NA values in PATIENT_ID with suitable id (done in next chapter),
  • removing patients with no biomarker values at all,
  • renaming biomarker columns.
df <- org_df %>% 
        mutate(gender = as.factor(ifelse(gender==1, "male", "female"))) %>%
        mutate(outcome = as.factor(ifelse(outcome == 0, "Survived", "Died"))) %>%
        filter(!is.na(org_df$RE_DATE)) %>% 
        rename(admission_time = 'Admission time',
               discharge_time = 'Discharge time',
               hs_CRP = 'High sensitivity C-reactive protein')

names(df)[34] <- "Tumor necrosis factor alpha"
names(df)[37] <- "Interleukin 1 beta"
names(df)[68] <- "Gamma glutamyl transpeptidase"

Dataset summary

The dataset consists of 81 variables and has 6106 observations (blood tests). See the summary below:

summary_df <- df %>% select(outcome, gender)
tbl_summary(
  summary_df,
  by = outcome,
  label = gender ~ "Gender") %>% 
  modify_header(label ~ "**Variable**") %>% 
  add_overall() %>%
  as_kable() %>%   kable_paper("hover")
Variable Overall, N = 6,106 Died, N = 2,897 Survived, N = 3,209
Gender
female 2,388 (39%) 749 (26%) 1,639 (51%)
male 3,718 (61%) 2,148 (74%) 1,570 (49%)

The blood tests were taken from 361 different patients.

df %>% select(PATIENT_ID, gender, outcome) %>% 
  drop_na(PATIENT_ID) %>% 
  select(-PATIENT_ID) %>%  
  tbl_summary(label =  gender ~ "Gender", by = outcome) %>%
  add_overall() %>% 
  modify_header(label ~ "**Variable**") %>%
  as_kable() %>%   kable_paper("hover")
Variable Overall, N = 361 Died, N = 166 Survived, N = 195
Gender
female 149 (41%) 46 (28%) 103 (53%)
male 212 (59%) 120 (72%) 92 (47%)

From the cleaned dataset, two dataset are created Patients and Blood tests containg specific values, in order to make data analysis easier.

Patients

One additional column was created to store the hospitalization time of all patients - used in further analysis to check the relation between length stay and outcome. Go to section Patients Visualization to see basic visualizations about patients.

patients <- df %>% 
              select(PATIENT_ID, age, gender, admission_time, discharge_time, outcome) %>% 
              drop_na(PATIENT_ID) %>%
              mutate("hospitalization_length" = round((difftime(discharge_time, admission_time, units = "days") ), digits = 2)) %>%
              relocate(hospitalization_length, .after = discharge_time)

head(patients) %>%
  kbl() %>%
  kable_paper("hover")
PATIENT_ID age gender admission_time discharge_time hospitalization_length outcome
1 73 male 2020-01-30 22:12:47 2020-02-17 12:40:09 17.60 days Survived
2 61 male 2020-02-04 21:39:03 2020-02-19 12:59:01 14.64 days Survived
3 70 female 2020-01-23 10:59:36 2020-02-08 17:52:31 16.29 days Survived
4 74 male 2020-01-31 23:03:59 2020-02-18 12:59:12 17.58 days Survived
5 29 female 2020-02-01 20:59:54 2020-02-18 10:33:06 16.56 days Survived
6 81 female 2020-01-24 10:47:10 2020-02-07 09:06:58 13.93 days Survived

Blood tests

blood_tests_df <- df %>%
  select(-c(admission_time, discharge_time)) %>%
  fill(PATIENT_ID)

markers_df <- blood_tests_df %>% select (-c(PATIENT_ID, age, RE_DATE, gender))

  tbl_summary(
    markers_df,
    by = outcome,
    missing = "no") %>% 
    modify_header(label = "**Marker**") %>%
    add_n() %>%
    bold_labels() %>%
    as_kable() %>%  
    kable_paper("hover") %>% 
    scroll_box(width = "100%", height = "200px")
Marker N Died, N = 2,897 Survived, N = 3,209
Hypersensitive cardiac troponinI 507 70 (18, 631) 3 (2, 7)
hemoglobin 975 123 (110, 135) 127 (116, 138)
Serum chloride 975 104 (100, 111) 101 (99, 103)
Prothrombin time 662 16.3 (15.0, 18.2) 13.6 (13.1, 14.1)
procalcitonin 459 0.38 (0.14, 1.13) 0.04 (0.02, 0.06)
eosinophils(%) 957 0.00 (0.00, 0.10) 0.70 (0.00, 1.80)
Interleukin 2 receptor 268 1,180 (807, 1,603) 529 (400, 742)
Alkaline phosphatase 930 83 (64, 123) 60 (50, 75)
albumin 934 28 (24, 31) 36 (34, 39)
basophil(%) 957 0.10 (0.10, 0.20) 0.20 (0.10, 0.40)
Interleukin 10 267 11 (6, 17) 5 (5, 8)
Total bilirubin 930 14 (10, 25) 8 (6, 12)
Platelet count 957 112 (55, 174) 229 (176, 290)
monocytes(%) 958 3.0 (2.0, 4.7) 8.2 (6.3, 10.0)
antithrombin 330 80 (70, 92) 93 (86, 103)
Interleukin 8 268 30 (18, 61) 11 (7, 19)
indirect bilirubin 906 6.2 (4.2, 9.2) 4.9 (3.4, 7.1)
Red blood cell distribution width 923 13.20 (12.40, 14.40) 12.20 (11.80, 12.80)
neutrophils(%) 957 92 (88, 95) 66 (56, 76)
total protein 931 62 (57, 68) 68 (65, 72)
Quantification of Treponema pallidum antibodies 279 0.06 (0.04, 0.07) 0.05 (0.04, 0.07)
Prothrombin activity 659 66 (56, 78) 94 (88, 103)
HBsAg 279 0.01 (0.00, 0.02) 0.00 (0.00, 0.01)
mean corpuscular volume 957 91.3 (87.1, 96.4) 89.8 (86.8, 91.9)
hematocrit 957 35.9 (32.5, 39.8) 37.1 (34.3, 39.9)
White blood cell count 1,127 12 (8, 17) 6 (4, 8)
Tumor necrosis factor alpha 268 11 (8, 17) 8 (6, 10)
mean corpuscular hemoglobin concentration 957 342 (331, 350) 343 (335, 350)
fibrinogen 566 3.92 (2.44, 5.63) 4.40 (3.56, 5.34)
Interleukin 1 beta 268 5.0 (5.0, 5.0) 5.0 (5.0, 5.0)
Urea 936 11 (7, 17) 4 (3, 5)
lymphocyte count 957 0.46 (0.31, 0.69) 1.25 (0.87, 1.62)
PH value 384 6.50 (6.00, 7.41) 6.50 (6.00, 7.00)
Red blood cell count 1,127 4.0 (3.6, 4.6) 4.2 (3.8, 4.7)
Eosinophil count 957 0.00 (0.00, 0.01) 0.03 (0.00, 0.09)
Corrected calcium 914 2.35 (2.27, 2.44) 2.37 (2.27, 2.44)
Serum potassium 980 4.60 (4.04, 5.27) 4.28 (3.92, 4.62)
glucose 775 9.1 (6.9, 13.3) 5.7 (5.0, 7.6)
neutrophils count 957 10.8 (7.0, 15.2) 3.5 (2.4, 5.2)
Direct bilirubin 930 8 (5, 14) 4 (2, 5)
Mean platelet volume 862 11.30 (10.70, 12.20) 10.40 (9.90, 11.00)
ferritin 283 1,636 (928, 2,517) 504 (235, 834)
RBC distribution width SD 923 43.7 (39.9, 48.5) 39.5 (37.6, 41.4)
Thrombin time 566 17.30 (15.80, 19.75) 16.40 (15.60, 17.30)
(%)lymphocyte 958 4 (2, 7) 24 (16, 33)
HCV antibody quantification 279 0.07 (0.04, 0.11) 0.06 (0.04, 0.08)
D-D dimer 630 19 (3, 21) 1 (0, 1)
Total cholesterol 931 3.32 (2.72, 3.88) 3.93 (3.39, 4.48)
aspartate aminotransferase 935 38 (25, 59) 21 (17, 29)
Uric acid 934 245 (166, 374) 240 (193, 304)
HCO3- 934 21.8 (18.8, 24.7) 24.7 (22.8, 26.7)
calcium 979 2.00 (1.90, 2.08) 2.17 (2.10, 2.25)
Amino-terminal brain natriuretic peptide precursor(NT-proBNP) 475 1,467 (516, 4,578) 64 (23, 166)
Lactate dehydrogenase 934 593 (431, 840) 220 (189, 278)
platelet large cell ratio 862 35 (30, 42) 28 (23, 33)
Interleukin 6 272 66 (30, 142) 8 (2, 21)
Fibrin degradation products 330 114 (18, 150) 4 (4, 4)
monocytes count 957 0.36 (0.20, 0.58) 0.43 (0.32, 0.58)
PLT distribution width 862 13.60 (12.10, 15.93) 11.70 (10.70, 13.00)
globulin 930 34.1 (30.2, 38.2) 31.8 (29.5, 35.2)
Gamma glutamyl transpeptidase 930 42 (27, 79) 29 (19, 46)
International standard ratio 659 1.31 (1.17, 1.48) 1.04 (0.99, 1.09)
basophil count(#) 957 0.010 (0.010, 0.030) 0.010 (0.010, 0.020)
2019-nCoV nucleic acid detection 501
-1 57 (100%) 444 (100%)
mean corpuscular hemoglobin 957 31.20 (29.90, 32.70) 30.70 (29.60, 31.90)
Activation of partial thromboplastin time 568 40 (36, 45) 39 (35, 43)
hs_CRP 737 114 (65, 191) 7 (2, 35)
HIV antibody quantification 278 0.08 (0.07, 0.11) 0.09 (0.08, 0.11)
serum sodium 975 142 (138, 148) 140 (138, 141)
thrombocytocrit 862 0.15 (0.10, 0.21) 0.24 (0.19, 0.30)
ESR 383 36 (16, 59) 26 (13, 40)
glutamic-pyruvic transaminase 931 26 (18, 44) 21 (15, 36)
eGFR 936 72 (43, 91) 100 (85, 114)
creatinine 936 88 (68, 130) 64 (54, 83)

The blood tests are prepared for further analysis. For each patient there were many blood samples, containing many missing values. All the samples have been combined into one sample containing the last value (closest to discharge).

last_sample_df <- blood_tests_df %>% 
  select(-RE_DATE) %>%
  group_by(PATIENT_ID) %>% 
  summarise(across(everything(), function(x) last(na.omit(x)))) %>%
  select(-PATIENT_ID)

The combined blood samples dataset was also preprocessed for classification. Go to section Classification - dataset cleaning to see how it was cleaned. Columns and patients with too many missing values were deleted from the dataset.

# %>% na_mean(option = "median")
class_df <- last_sample_df 

Visualization

Patients gender

ggplot(patients, aes(x = gender, fill = gender)) +
  geom_bar() + 
  labs(y = "Number of patients", 
       x = "Gender") +
  theme(legend.position = "none")

Patients grouped by age gender

patients_hist <- ggplot(patients, aes(x = age, fill = gender)) +
  geom_histogram(stat = "count",
                 binwidth = 1.2)+
  labs(y = "Number of patients", 
       x = "Age") +
  scale_x_continuous(breaks=seq(20, 100, 5))
 
ggplotly(patients_hist)      

Outcome grouped by gender, age

layout_ggplotly <- function(gg, x = -0.05, y = -0.05){
  # The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
  gg[['y']][['layout']][['annotations']][[1]][['y']] <- x
  gg[['y']][['layout']][['annotations']][[2]][['x']] <- y
  gg
}

patients_outcome <- ggplot(patients, aes(x = age, fill = outcome)) + 
                    geom_histogram(binwidth = 1.2) +
                    facet_grid(~ gender) +
                    scale_y_continuous(breaks=seq(0, 20, 2)) +
                    scale_x_continuous(breaks=seq(20, 100, 5)) +
                    labs(y = "Number of patients", x = "Age")

ggplotly(patients_outcome)

Outcome due to hospitalization length grouped by gender

hospitalization_length_plot <- ggplot(patients, aes(x = hospitalization_length, fill = outcome)) + 
      geom_histogram(binwidth = 1.2) +
      facet_grid(outcome ~ gender) +
      scale_y_continuous(breaks=seq(0, 20, 2)) +
      scale_x_continuous(breaks=seq(0, 40, 5)) +
      labs(y = "Number of patients", 
           x = "Hospitalization length [days]")

ggplotly(hospitalization_length_plot)

Died in specific days

outcome_per_day <- patients %>% 
                    mutate(discharge_time = as.Date(discharge_time)) %>%
                    filter(outcome == "Died")
                  
outcome_per_day_plot <- ggplot(outcome_per_day, aes(x = discharge_time, fill = outcome)) + 
  geom_histogram(binwidth = 1.2) + 
  facet_grid(~ gender) +  
  labs(x = "Discharge date", y = "Number of deaths") + 
  theme(legend.position = "none")

ggplotly(outcome_per_day_plot)

Dead cases during the day

outcome_during_day_plot <- patients %>%  
  mutate(time_h_m = hms(format(patients$discharge_time, format = "%H:%M:%S"))) %>% 
  mutate(time_h_m = (hour(time_h_m) + minute(time_h_m)/60)) %>%
  filter(outcome == "Died") %>%
  ggplot(aes(x = time_h_m, fill = "blue")) + 
  geom_histogram(binwidth = 1.2) + 
  scale_x_continuous(breaks = seq(0, 24, by = 1)) + 
  labs(x = "Number of dead cases", y = "Time of the day")+
  theme(legend.position = "none")

ggplotly(outcome_during_day_plot)

Variables correlation

Preparing the dataset for correlation (changing factor variables to numeric).

cor_df <- last_sample_df %>%
            mutate(outcome = ifelse(last_sample_df$outcome == "Died", 1, 0)) %>%
            mutate(gender = ifelse(last_sample_df$gender == "male", 1, 0)) %>%
            rename(male = gender)

correlationMatrix <-  correlate(cor_df[sapply(cor_df, is.numeric)], use='pairwise.complete.obs')

Age correlation

From the previous analysis, it is known that elderly people are more susceptible to die due to Covid-19. Below short summary, what biomarkers are highly correlated with age.

age_correlation <- correlationMatrix %>% 
  focus(age) %>% 
  mutate(age = abs(age)) %>%
  arrange(desc(age)) %>% 
  filter(rowname != "outcome") %>% head 

age_correlation %>% kbl() %>% kable_paper("hover")
rowname age
eGFR 0.6119405
(%)lymphocyte 0.5171992
neutrophils(%) 0.4885978
albumin 0.4870837
hs_CRP 0.4299006
neutrophils count 0.4073328

The most correlated is eGFR which is used to measure the the effectiveness of the work of the kidneys. Its hard to present a norm value, because this marker depends on many factors like gender, age, body mass, but some sources show that value above 90 is proper. Too low ,and too high value of GFR in some cases indicate kidney diseases which affect the blood filtration.

Below chart presents the GFR value between patients in different age, grouped by outcome. It’s analysis shows, that many elderly patients that died, had some abnormalities in the work of the kidneys.

ggplot(last_sample_df, aes(x = age, y = `eGFR`, color = outcome)) + 
  geom_point() + 
  theme(legend.position = c(0.9,0.9)) + 
  ylim(0 , 150)

The next two high correlated biomarkers are related to immune system. The values of lymphocyte and neutrophils show how strong the organism is and how well it fights with the disease.

Lymphocytes are cells responsible for protecting our body (by creating anitbodies) from viruses, bacteria and other disease causing factors. The norm value for an adult is between 15 - 40%. Lower lymphocytes levels means, that the body cannot fight the disease. The left chart below confirms, that elderly people have weaker immune system and it’s hard for their organism to fight the disease.

plot1 <- ggplot(last_sample_df, aes(x = age, y = `(%)lymphocyte`, color = outcome)) + geom_point() + theme(legend.position = "none")
plot2 <- ggplot(last_sample_df, aes(x = `lymphocyte count`, y = `(%)lymphocyte`, color = outcome)) + 
  geom_point() + 
  theme(legend.position = c(0.8, 0.2)) +
  xlim(0,3.75)
grid.arrange(plot1, plot2, ncol=2)

Neutrophils are essential part of immune system - this cells search for pathogens in organisms and destroy them. High value of neutrophils(%) results in many neutrophil cells in blood (right plot below), which means that a medical condition occurs in patients body and that the immune system fights it.

This correlation explains that elderly people are more vulnerable, and their immune systems need to produce more neutrophils to fight the pathogens than younger patients. The left plot shows that some of the tested patients had some medical condition, due to increased amount of neutrophils. Adding the information about the outcome, confirms that elderly patients are more likely to die because of Covid-19.

plot1 <- ggplot(last_sample_df, aes(x = age, y = `neutrophils(%)`, color = outcome)) + geom_point() + theme(legend.position = "none")
plot2 <- ggplot(last_sample_df, aes(x = `neutrophils count`, y = `neutrophils(%)`, color = outcome)) + geom_point() + theme(legend.position = c(0.8, 0.2))
grid.arrange(plot1, plot2, ncol=2)

Outcome correlation

The following section is devoted to check the correlation between biomarkers and the outcome.

The correlation matrix for the highest correlated variables and the numeric correlation values are shown below.

'%ni%' <- Negate('%in%')

outcome_cor <- correlationMatrix %>% 
  focus(outcome) %>% 
  mutate(outcome = abs(outcome)) %>%
  arrange(desc(outcome)) %>% 
  filter(`rowname` %ni% c('neutrophils(%)', 'neutrophils count')) %>% 
  mutate(outcome = round(outcome,2)) %>%
  filter(outcome > 0.5)
  

outcome_corr_df <- cor_df %>% select(c(outcome_cor$rowname, outcome)) 

outcome_cor_matrix <- cor(outcome_corr_df[sapply(outcome_corr_df, is.numeric)], use='pairwise.complete.obs')

corrplot(outcome_cor_matrix)

The previous sections contains the analysis about lymphocytes and how important they are when fighting the disease, that’s why they won’t be considered in this section.

outcome_cor %>% kbl() %>% kable_paper("hover") 
rowname outcome
(%)lymphocyte 0.76
hs_CRP 0.72
albumin 0.72
Lactate dehydrogenase 0.69
Prothrombin activity 0.68
D-D dimer 0.68
Fibrin degradation products 0.66
calcium 0.64
Platelet count 0.58
age 0.56
eosinophils(%) 0.55
HCO3- 0.54
thrombocytocrit 0.53
monocytes(%) 0.51

Below in each tab are presented the values of each biomarkers (correlation > 0.65) for all the patients grouped by age and outcome. Analysis of theses data shows, that all the biomarkers are also somehow correlated with the age, because the biomarkers values for eldery are very often (in this 5 biomarkers) outstanding from the values for people less than 50 years. This statement is confirmed by the boxplots below every chart, preseting the distribution of the biomarkers grouped by age group (adult - less than 64 years, eldery - more than 64 years), gender and outcome.

layout_ggplotly <- function(gg, x = -0.02, y = -0.05){
  # The 1 and 2 goes into the list that contains the options for the x and y axis labels respectively
  gg[['x']][['layout']][['annotations']][[1]][['y']] <- x
  gg[['x']][['layout']][['annotations']][[2]][['x']] <- y
  gg
}

Albumin

ggplot(last_sample_df, aes(x = age, y = `albumin`, color = outcome)) + geom_point()

albumin_plot <- last_sample_df %>% 
  mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>% 
  ggplot(aes(x= age_group, y = `albumin`, fill = gender)) +
  geom_boxplot(na.rm=TRUE) +  facet_grid(~outcome) + 
  labs(x = "Age group", y = "Albumin")

ggplotly(albumin_plot) %>% layout(boxmode = "group") %>% layout_ggplotly

Prothrombin activity

ggplot(last_sample_df, aes(x = age, y = `Prothrombin activity`, color = outcome)) + geom_point()

pt_plot <- last_sample_df %>% 
  mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>% 
  ggplot(aes(x= age_group, y = `Prothrombin activity`, fill = gender)) +
  geom_boxplot(na.rm=TRUE) +  facet_grid(~outcome) + 
  labs(x = "Age group", y = "Prothrombin activity")

ggplotly(pt_plot) %>% layout(boxmode = "group") %>% layout_ggplotly

Hs-CRP

A norm value for hs-CRP is about 50. All values above that level indicate some kind of inflammation in the body. Many values on the first plot are much more above the norm level showing very strong inflammation which eventually (probably) contributed to the death.

ggplot(last_sample_df, aes(x = age, y = hs_CRP, color = outcome)) + geom_point()

crp_plot <- last_sample_df %>%
  mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>%
  ggplot(aes(x= age_group, y = hs_CRP, fill = gender)) +
  geom_boxplot(na.rm=TRUE) +  facet_grid(~outcome) +
  labs(x = "Age group", y = "High sensitivity C-reactive protein")

ggplotly(crp_plot) %>% layout(boxmode = "group") %>% layout_ggplotly

D-D dimer

D-dimers are cells responsible for decomposition of a clot. Their high value mean that there was a blood clot in the organism. Sometimes it can be linked with myocardial infarction, pulmonary embolism which combined with Covid-19 symptoms can lead to death.

ggplot(last_sample_df, aes(x = age, y = `D-D dimer`, color = outcome)) + geom_point()

dimer_plot <- last_sample_df %>% 
  mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>% 
  ggplot(aes(x= age_group, y = `D-D dimer`, fill = gender)) +
  geom_boxplot(na.rm=TRUE) +  facet_grid(~outcome) + 
  labs(x = "Age group", y = "D-D dimer")

ggplotly(dimer_plot) %>% layout(boxmode = "group") %>% layout_ggplotly

Lactate dehydrogenase

ggplot(last_sample_df, aes(x = age, y = `Lactate dehydrogenase`, color = outcome)) + geom_point()

ldh_plot <- last_sample_df %>% 
  mutate(age_group = as.factor(ifelse(last_sample_df$age < 64, 'adult', 'elderly'))) %>% 
  ggplot(aes(x= age_group, y = `Lactate dehydrogenase`, fill = gender)) +
  geom_boxplot(na.rm=TRUE) +  facet_grid(~outcome) + 
  labs(x = "Age group", y = "Lactate dehydrogenase")

ggplotly(ldh_plot) %>% layout(boxmode = "group") %>% layout_ggplotly

Animation

Animated aggregate number of deaths in next days is presented below. A shoot up can be noticed between 02.02.2020 - 22.02.2020. After that the deaths levelled off, and another peak occcured on 04.04.2020.

patients_agg <- patients %>% select(c(discharge_time, outcome)) %>%
  mutate(discharge_time = as.Date(patients$discharge_time, "%m/%d/%Y" )) %>%
  filter(outcome == 'Died') %>%
  group_by(discharge_time) %>%
  summarise(deaths_count = n(), .groups="drop") %>%
  arrange(discharge_time) %>%
  mutate(deaths_count_agg = cumsum(deaths_count))

ggplot(patients_agg, aes(x = discharge_time, y = deaths_count_agg)) + 
  geom_line(size  = 1.1, color = 'red') + 
  transition_reveal(discharge_time) + 
  labs(x = "Discharde time", y = "Deaths count aggregate") + 
  scale_x_continuous(breaks = seq(min(patients_agg$discharge_time), max(patients_agg$discharge_time), 10))

Classification model

In this chapter classification model is trained to predict the outcome (death/survival) of COVID-19 sick patients based on basic patients observations and blood test samples. One blood test for each patient is considered as an observation for the machine learning algorithm. As it was explained extra data pre processing was needed to prepare the dataset. For each patient, all the blood test are reduced to one row, containing the closest value to the discharge time.

Redundant columns like patient id, blood test time and admission and discharge time were removed from the dataset.

Dataset cleaning

For machine learning process there should be no missing values in the dataset. Summary below shows, that there are columns with many missing values.

class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
Hypersensitive cardiac troponinI hemoglobin Serum chloride Prothrombin time procalcitonin eosinophils(%) Interleukin 2 receptor Alkaline phosphatase albumin basophil(%) Interleukin 10 Total bilirubin Platelet count monocytes(%) antithrombin Interleukin 8 indirect bilirubin Red blood cell distribution width neutrophils(%) total protein Quantification of Treponema pallidum antibodies Prothrombin activity HBsAg mean corpuscular volume hematocrit White blood cell count Tumor necrosis factor alpha mean corpuscular hemoglobin concentration fibrinogen Interleukin 1 beta Urea lymphocyte count PH value Red blood cell count Eosinophil count Corrected calcium Serum potassium glucose neutrophils count Direct bilirubin Mean platelet volume ferritin RBC distribution width SD Thrombin time (%)lymphocyte HCV antibody quantification D-D dimer Total cholesterol aspartate aminotransferase Uric acid HCO3- calcium Amino-terminal brain natriuretic peptide precursor(NT-proBNP) Lactate dehydrogenase platelet large cell ratio Interleukin 6 Fibrin degradation products monocytes count PLT distribution width globulin Gamma glutamyl transpeptidase International standard ratio basophil count(#) 2019-nCoV nucleic acid detection mean corpuscular hemoglobin Activation of partial thromboplastin time hs_CRP HIV antibody quantification serum sodium thrombocytocrit ESR glutamic-pyruvic transaminase eGFR creatinine
Min. : 1.90 Min. : 6.4 Min. : 77.70 Min. :11.50 Min. : 0.020 Min. :0.000 Min. : 61.0 Min. : 17.00 Min. :13.60 Min. :0.0000 Min. : 5.00 Min. : 2.80 Min. : -1 Min. : 0.600 Min. : 20.00 Min. : 5.00 Min. : 0.100 Min. :10.60 Min. : 1.90 Min. :31.80 Min. : 0.0200 Min. : 7.00 Min. : 0.000 Min. : 62.30 Min. :15.60 Min. : 0.71 Min. : 4.000 Min. :286.0 Min. :0.500 Min. : 5.000 Min. : 1.70 Min. : 0.050 Min. :5.000 Min. : 0.100 Min. :0.00000 Min. :1.650 Min. :2.760 Min. : 1.000 Min. : 0.320 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. : 13.00 Min. : 0.300 Min. :0.0200 Min. : 0.2100 Min. :0.100 Min. : 6.00 Min. : 52.0 Min. : 6.30 Min. :1.170 Min. : 5.0 Min. : 110.0 Min. :11.20 Min. : 1.500 Min. : 4.00 Min. : 0.010 Min. : 8.10 Min. :10.10 Min. : 7.00 Min. : 0.840 Min. :0.0000 Min. :-1 Min. :20.80 Min. : 21.80 Min. : 0.10 Min. :0.05000 Min. :121.1 Min. :0.0100 Min. : 1.0 Min. : 5.00 Min. : 2.00 Min. : 14.0
1st Qu.: 2.45 1st Qu.:112.0 1st Qu.: 99.53 1st Qu.:13.40 1st Qu.: 0.030 1st Qu.:0.000 1st Qu.: 457.8 1st Qu.: 54.00 1st Qu.:28.20 1st Qu.:0.1000 1st Qu.: 5.00 1st Qu.: 7.20 1st Qu.:113 1st Qu.: 2.975 1st Qu.: 76.25 1st Qu.: 8.10 1st Qu.: 3.700 1st Qu.:12.03 1st Qu.:61.73 1st Qu.:61.20 1st Qu.: 0.0400 1st Qu.: 67.00 1st Qu.: 0.000 1st Qu.: 86.90 1st Qu.:33.00 1st Qu.: 5.12 1st Qu.: 6.675 1st Qu.:332.0 1st Qu.:3.183 1st Qu.: 5.000 1st Qu.: 3.80 1st Qu.: 0.520 1st Qu.:6.000 1st Qu.: 3.550 1st Qu.:0.00000 1st Qu.:2.260 1st Qu.:4.032 1st Qu.: 5.120 1st Qu.: 3.100 1st Qu.: 3.100 1st Qu.:10.10 1st Qu.: 402.0 1st Qu.: 38.80 1st Qu.: 15.60 1st Qu.: 4.175 1st Qu.:0.0400 1st Qu.: 0.4925 1st Qu.:2.950 1st Qu.: 19.00 1st Qu.: 198.8 1st Qu.:20.90 1st Qu.:1.990 1st Qu.: 58.5 1st Qu.: 199.0 1st Qu.:25.32 1st Qu.: 3.955 1st Qu.: 4.00 1st Qu.: 0.310 1st Qu.:10.93 1st Qu.:28.98 1st Qu.: 21.00 1st Qu.: 1.018 1st Qu.:0.0100 1st Qu.:-1 1st Qu.:29.70 1st Qu.: 35.15 1st Qu.: 2.00 1st Qu.:0.07000 1st Qu.:138.3 1st Qu.:0.1400 1st Qu.: 13.0 1st Qu.: 17.00 1st Qu.: 66.70 1st Qu.: 58.0
Median : 12.30 Median :125.0 Median :102.30 Median :14.30 Median : 0.100 Median :0.250 Median : 663.5 Median : 71.00 Median :33.20 Median :0.2000 Median : 5.20 Median : 10.60 Median :192 Median : 6.250 Median : 87.00 Median : 14.75 Median : 5.300 Median :12.75 Median :77.55 Median :66.00 Median : 0.0500 Median : 86.50 Median : 0.010 Median : 90.40 Median :36.30 Median : 7.93 Median : 8.300 Median :342.0 Median :4.220 Median : 5.000 Median : 5.40 Median : 0.990 Median :6.000 Median : 4.100 Median :0.02000 Median :2.370 Median :4.430 Median : 6.540 Median : 5.380 Median : 4.800 Median :10.80 Median : 759.7 Median : 41.20 Median : 16.55 Median :14.350 Median :0.0600 Median : 1.3300 Median :3.720 Median : 25.00 Median : 260.0 Median :23.90 Median :2.110 Median : 304.0 Median : 273.5 Median :30.85 Median : 18.010 Median : 5.80 Median : 0.430 Median :12.50 Median :32.40 Median : 33.00 Median : 1.095 Median :0.0200 Median :-1 Median :30.90 Median : 38.90 Median : 26.30 Median :0.09000 Median :140.7 Median :0.2100 Median : 28.0 Median : 26.00 Median : 89.35 Median : 74.0
Mean : 795.91 Mean :124.3 Mean :103.30 Mean :16.04 Mean : 1.095 Mean :0.902 Mean : 934.6 Mean : 85.62 Mean :32.67 Mean :0.2646 Mean : 12.89 Mean : 16.50 Mean :193 Mean : 6.525 Mean : 86.36 Mean : 95.37 Mean : 6.757 Mean :13.22 Mean :75.39 Mean :65.28 Mean : 0.1332 Mean : 81.25 Mean : 8.427 Mean : 90.61 Mean :36.58 Mean : 18.93 Mean : 11.929 Mean :342.1 Mean :4.305 Mean : 6.716 Mean : 9.88 Mean : 1.166 Mean :6.348 Mean : 8.449 Mean :0.05379 Mean :2.347 Mean :4.500 Mean : 8.525 Mean : 8.001 Mean : 9.767 Mean :10.98 Mean : 1519.3 Mean : 42.83 Mean : 17.72 Mean :16.913 Mean :0.1119 Mean : 6.2456 Mean :3.748 Mean : 54.22 Mean : 296.1 Mean :23.20 Mean :2.096 Mean : 3772.4 Mean : 476.5 Mean :32.22 Mean : 127.050 Mean : 46.73 Mean : 0.596 Mean :13.23 Mean :32.58 Mean : 49.44 Mean : 1.298 Mean :0.0214 Mean :-1 Mean :31.01 Mean : 41.27 Mean : 64.86 Mean :0.09931 Mean :141.8 Mean :0.2131 Mean : 33.6 Mean : 42.66 Mean : 81.74 Mean : 119.7
3rd Qu.: 79.85 3rd Qu.:138.0 3rd Qu.:105.58 3rd Qu.:16.30 3rd Qu.: 0.450 3rd Qu.:1.500 3rd Qu.:1172.5 3rd Qu.: 98.00 3rd Qu.:37.62 3rd Qu.:0.4000 3rd Qu.: 11.90 3rd Qu.: 16.12 3rd Qu.:257 3rd Qu.: 8.900 3rd Qu.: 98.00 3rd Qu.: 34.42 3rd Qu.: 7.900 3rd Qu.:13.80 3rd Qu.:91.92 3rd Qu.:70.42 3rd Qu.: 0.0700 3rd Qu.: 98.00 3rd Qu.: 0.010 3rd Qu.: 94.22 3rd Qu.:40.12 3rd Qu.: 13.20 3rd Qu.: 11.600 3rd Qu.:349.0 3rd Qu.:5.410 3rd Qu.: 5.000 3rd Qu.:11.53 3rd Qu.: 1.540 3rd Qu.:7.000 3rd Qu.: 4.650 3rd Qu.:0.09000 3rd Qu.:2.450 3rd Qu.:4.817 3rd Qu.: 9.915 3rd Qu.:11.242 3rd Qu.: 7.425 3rd Qu.:11.60 3rd Qu.: 1436.6 3rd Qu.: 45.27 3rd Qu.: 17.90 3rd Qu.:27.525 3rd Qu.:0.0900 3rd Qu.:12.0175 3rd Qu.:4.380 3rd Qu.: 41.00 3rd Qu.: 349.1 3rd Qu.:26.32 3rd Qu.:2.220 3rd Qu.: 1921.0 3rd Qu.: 617.8 3rd Qu.:37.75 3rd Qu.: 61.123 3rd Qu.:101.78 3rd Qu.: 0.610 3rd Qu.:14.50 3rd Qu.:35.80 3rd Qu.: 55.00 3rd Qu.: 1.302 3rd Qu.:0.0300 3rd Qu.:-1 3rd Qu.:32.20 3rd Qu.: 44.20 3rd Qu.: 99.10 3rd Qu.:0.11000 3rd Qu.:143.3 3rd Qu.:0.2775 3rd Qu.: 47.0 3rd Qu.: 42.00 3rd Qu.:105.00 3rd Qu.: 97.0
Max. :50000.00 Max. :178.0 Max. :140.40 Max. :92.10 Max. :57.170 Max. :8.600 Max. :7500.0 Max. :620.00 Max. :47.60 Max. :1.7000 Max. :500.00 Max. :295.40 Max. :554 Max. :53.000 Max. :136.00 Max. :6795.00 Max. :59.700 Max. :27.10 Max. :98.90 Max. :83.40 Max. :11.9500 Max. :142.00 Max. :250.000 Max. :117.60 Max. :52.30 Max. :1726.60 Max. :168.000 Max. :488.0 Max. :8.950 Max. :88.500 Max. :68.40 Max. :33.690 Max. :7.565 Max. :749.500 Max. :0.46000 Max. :2.790 Max. :9.860 Max. :38.820 Max. :32.220 Max. :242.900 Max. :15.00 Max. :50000.0 Max. :113.30 Max. :144.90 Max. :48.500 Max. :2.0900 Max. :21.0000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :33.80 Max. :2.600 Max. :70000.0 Max. :1867.0 Max. :62.20 Max. :5000.000 Max. :190.80 Max. :39.920 Max. :25.30 Max. :49.20 Max. :732.00 Max. :11.570 Max. :0.1200 Max. :-1 Max. :50.80 Max. :106.40 Max. :320.00 Max. :0.27000 Max. :179.5 Max. :0.5100 Max. :110.0 Max. :1508.00 Max. :206.90 Max. :1497.0
NA’s :74 NA’s :5 NA’s :7 NA’s :9 NA’s :48 NA’s :5 NA’s :145 NA’s :5 NA’s :5 NA’s :5 NA’s :146 NA’s :5 NA’s :5 NA’s :5 NA’s :159 NA’s :145 NA’s :6 NA’s :11 NA’s :5 NA’s :5 NA’s :86 NA’s :9 NA’s :86 NA’s :5 NA’s :5 NA’s :4 NA’s :145 NA’s :5 NA’s :63 NA’s :145 NA’s :5 NA’s :5 NA’s :129 NA’s :4 NA’s :5 NA’s :8 NA’s :7 NA’s :10 NA’s :5 NA’s :5 NA’s :15 NA’s :148 NA’s :11 NA’s :63 NA’s :5 NA’s :86 NA’s :19 NA’s :5 NA’s :5 NA’s :5 NA’s :5 NA’s :7 NA’s :94 NA’s :5 NA’s :15 NA’s :143 NA’s :159 NA’s :5 NA’s :15 NA’s :5 NA’s :5 NA’s :9 NA’s :5 NA’s :143 NA’s :5 NA’s :63 NA’s :8 NA’s :87 NA’s :7 NA’s :15 NA’s :73 NA’s :5 NA’s :5 NA’s :5

Below cleaning is done, to check if the dataset contains patients with basic info like age and gender, but with no many missing biomarker values - these patients are removed from the dataset.

#Deleting rows with no many missing values
rows_to_delete <- c()

for(i in 1:nrow(class_df)) {
  row_na_sum <- sum(is.na(class_df[i,]))
  if (row_na_sum >= 35) {
    rows_to_delete <- c(rows_to_delete, i)
  }
}
patients_to_delete <- length(rows_to_delete)

class_df <- class_df[-rows_to_delete, ]

class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
Hypersensitive cardiac troponinI hemoglobin Serum chloride Prothrombin time procalcitonin eosinophils(%) Interleukin 2 receptor Alkaline phosphatase albumin basophil(%) Interleukin 10 Total bilirubin Platelet count monocytes(%) antithrombin Interleukin 8 indirect bilirubin Red blood cell distribution width neutrophils(%) total protein Quantification of Treponema pallidum antibodies Prothrombin activity HBsAg mean corpuscular volume hematocrit White blood cell count Tumor necrosis factor alpha mean corpuscular hemoglobin concentration fibrinogen Interleukin 1 beta Urea lymphocyte count PH value Red blood cell count Eosinophil count Corrected calcium Serum potassium glucose neutrophils count Direct bilirubin Mean platelet volume ferritin RBC distribution width SD Thrombin time (%)lymphocyte HCV antibody quantification D-D dimer Total cholesterol aspartate aminotransferase Uric acid HCO3- calcium Amino-terminal brain natriuretic peptide precursor(NT-proBNP) Lactate dehydrogenase platelet large cell ratio Interleukin 6 Fibrin degradation products monocytes count PLT distribution width globulin Gamma glutamyl transpeptidase International standard ratio basophil count(#) 2019-nCoV nucleic acid detection mean corpuscular hemoglobin Activation of partial thromboplastin time hs_CRP HIV antibody quantification serum sodium thrombocytocrit ESR glutamic-pyruvic transaminase eGFR creatinine
Min. : 1.90 Min. : 6.4 Min. : 77.7 Min. :11.50 Min. : 0.020 Min. :0.0000 Min. : 61.0 Min. : 17.00 Min. :13.60 Min. :0.0000 Min. : 5.00 Min. : 2.80 Min. : -1.0 Min. : 0.600 Min. : 20.00 Min. : 5.00 Min. : 0.100 Min. :10.60 Min. : 1.90 Min. :31.80 Min. : 0.0200 Min. : 7.00 Min. : 0.000 Min. : 62.30 Min. :15.60 Min. : 0.710 Min. : 4.000 Min. :286.0 Min. :0.50 Min. : 5.000 Min. : 1.70 Min. : 0.050 Min. :5.000 Min. : 0.100 Min. :0.00000 Min. :1.650 Min. :2.760 Min. : 1.000 Min. : 0.320 Min. : 1.600 Min. : 8.50 Min. : 17.8 Min. : 31.30 Min. : 13.00 Min. : 0.30 Min. :0.0200 Min. : 0.210 Min. :0.100 Min. : 6.00 Min. : 52.0 Min. : 6.30 Min. :1.170 Min. : 5 Min. : 110.0 Min. :11.20 Min. : 1.500 Min. : 4.00 Min. : 0.0100 Min. : 8.10 Min. :10.10 Min. : 7.00 Min. : 0.840 Min. :0.00000 Min. :-1 Min. :20.80 Min. : 21.80 Min. : 0.10 Min. :0.05000 Min. :121.1 Min. :0.010 Min. : 1.00 Min. : 5.00 Min. : 2.00 Min. : 14.0
1st Qu.: 2.45 1st Qu.:112.0 1st Qu.: 99.6 1st Qu.:13.40 1st Qu.: 0.030 1st Qu.:0.0000 1st Qu.: 457.8 1st Qu.: 54.00 1st Qu.:28.20 1st Qu.:0.1000 1st Qu.: 5.00 1st Qu.: 7.20 1st Qu.:113.0 1st Qu.: 2.950 1st Qu.: 76.00 1st Qu.: 8.10 1st Qu.: 3.725 1st Qu.:12.00 1st Qu.:61.85 1st Qu.:61.20 1st Qu.: 0.0400 1st Qu.: 67.00 1st Qu.: 0.000 1st Qu.: 86.95 1st Qu.:33.00 1st Qu.: 5.115 1st Qu.: 6.675 1st Qu.:332.0 1st Qu.:3.18 1st Qu.: 5.000 1st Qu.: 3.80 1st Qu.: 0.520 1st Qu.:6.000 1st Qu.: 3.550 1st Qu.:0.00000 1st Qu.:2.260 1st Qu.:4.030 1st Qu.: 5.120 1st Qu.: 3.110 1st Qu.: 3.100 1st Qu.:10.10 1st Qu.: 402.0 1st Qu.: 38.80 1st Qu.: 15.60 1st Qu.: 4.15 1st Qu.:0.0400 1st Qu.: 0.490 1st Qu.:2.950 1st Qu.: 19.00 1st Qu.: 198.5 1st Qu.:20.90 1st Qu.:1.990 1st Qu.: 57 1st Qu.: 198.0 1st Qu.:25.30 1st Qu.: 3.955 1st Qu.: 4.00 1st Qu.: 0.3100 1st Qu.:11.00 1st Qu.:28.95 1st Qu.: 21.00 1st Qu.: 1.015 1st Qu.:0.01000 1st Qu.:-1 1st Qu.:29.70 1st Qu.: 35.10 1st Qu.: 2.00 1st Qu.:0.07000 1st Qu.:138.3 1st Qu.:0.140 1st Qu.: 13.50 1st Qu.: 17.00 1st Qu.: 66.75 1st Qu.: 58.0
Median : 12.30 Median :125.0 Median :102.3 Median :14.30 Median : 0.095 Median :0.2000 Median : 663.5 Median : 71.00 Median :33.20 Median :0.2000 Median : 5.20 Median : 10.60 Median :190.0 Median : 6.200 Median : 87.00 Median : 14.75 Median : 5.300 Median :12.70 Median :77.80 Median :66.00 Median : 0.0500 Median : 86.00 Median : 0.010 Median : 90.40 Median :36.30 Median : 7.930 Median : 8.300 Median :342.0 Median :4.22 Median : 5.000 Median : 5.40 Median : 0.990 Median :6.000 Median : 4.100 Median :0.02000 Median :2.370 Median :4.430 Median : 6.540 Median : 5.390 Median : 4.800 Median :10.80 Median : 759.7 Median : 41.20 Median : 16.50 Median :14.20 Median :0.0600 Median : 1.330 Median :3.720 Median : 25.00 Median : 260.0 Median :23.90 Median :2.110 Median : 290 Median : 274.0 Median :30.90 Median : 18.010 Median : 5.80 Median : 0.4300 Median :12.50 Median :32.40 Median : 33.00 Median : 1.100 Median :0.02000 Median :-1 Median :30.90 Median : 38.90 Median : 26.50 Median :0.09000 Median :140.7 Median :0.210 Median : 28.00 Median : 26.00 Median : 89.40 Median : 74.0
Mean : 795.91 Mean :124.4 Mean :103.3 Mean :16.05 Mean : 1.098 Mean :0.8994 Mean : 934.6 Mean : 85.68 Mean :32.65 Mean :0.2631 Mean : 12.89 Mean : 16.54 Mean :192.9 Mean : 6.519 Mean : 86.32 Mean : 95.37 Mean : 6.771 Mean :13.21 Mean :75.47 Mean :65.25 Mean : 0.1335 Mean : 81.22 Mean : 8.489 Mean : 90.62 Mean :36.58 Mean : 18.974 Mean : 11.929 Mean :342.1 Mean :4.30 Mean : 6.716 Mean : 9.85 Mean : 1.163 Mean :6.347 Mean : 8.326 Mean :0.05369 Mean :2.347 Mean :4.501 Mean : 8.525 Mean : 8.017 Mean : 9.788 Mean :10.98 Mean : 1519.3 Mean : 42.83 Mean : 17.72 Mean :16.84 Mean :0.1116 Mean : 6.262 Mean :3.745 Mean : 54.35 Mean : 295.4 Mean :23.20 Mean :2.095 Mean : 3757 Mean : 477.3 Mean :32.24 Mean : 127.050 Mean : 46.95 Mean : 0.5964 Mean :13.24 Mean :32.56 Mean : 49.53 Mean : 1.299 Mean :0.02135 Mean :-1 Mean :31.01 Mean : 41.27 Mean : 64.98 Mean :0.09912 Mean :141.8 Mean :0.213 Mean : 33.68 Mean : 42.76 Mean : 81.95 Mean : 117.9
3rd Qu.: 79.85 3rd Qu.:138.0 3rd Qu.:105.6 3rd Qu.:16.30 3rd Qu.: 0.450 3rd Qu.:1.5000 3rd Qu.:1172.5 3rd Qu.: 98.00 3rd Qu.:37.60 3rd Qu.:0.4000 3rd Qu.: 11.90 3rd Qu.: 16.15 3rd Qu.:257.0 3rd Qu.: 8.900 3rd Qu.: 98.00 3rd Qu.: 34.42 3rd Qu.: 7.900 3rd Qu.:13.80 3rd Qu.:91.95 3rd Qu.:70.40 3rd Qu.: 0.0700 3rd Qu.: 98.00 3rd Qu.: 0.010 3rd Qu.: 94.25 3rd Qu.:40.15 3rd Qu.: 13.170 3rd Qu.: 11.600 3rd Qu.:349.0 3rd Qu.:5.41 3rd Qu.: 5.000 3rd Qu.:11.50 3rd Qu.: 1.540 3rd Qu.:7.000 3rd Qu.: 4.640 3rd Qu.:0.09000 3rd Qu.:2.450 3rd Qu.:4.820 3rd Qu.: 9.915 3rd Qu.:11.275 3rd Qu.: 7.450 3rd Qu.:11.60 3rd Qu.: 1436.6 3rd Qu.: 45.30 3rd Qu.: 17.90 3rd Qu.:27.50 3rd Qu.:0.0900 3rd Qu.:12.050 3rd Qu.:4.370 3rd Qu.: 41.00 3rd Qu.: 347.9 3rd Qu.:26.35 3rd Qu.:2.220 3rd Qu.: 1894 3rd Qu.: 618.5 3rd Qu.:37.80 3rd Qu.: 61.123 3rd Qu.:104.10 3rd Qu.: 0.6100 3rd Qu.:14.50 3rd Qu.:35.75 3rd Qu.: 55.00 3rd Qu.: 1.305 3rd Qu.:0.03000 3rd Qu.:-1 3rd Qu.:32.20 3rd Qu.: 44.20 3rd Qu.: 99.12 3rd Qu.:0.11000 3rd Qu.:143.3 3rd Qu.:0.280 3rd Qu.: 47.00 3rd Qu.: 42.00 3rd Qu.:105.00 3rd Qu.: 97.0
Max. :50000.00 Max. :178.0 Max. :140.4 Max. :92.10 Max. :57.170 Max. :8.6000 Max. :7500.0 Max. :620.00 Max. :47.60 Max. :1.7000 Max. :500.00 Max. :295.40 Max. :554.0 Max. :53.000 Max. :136.00 Max. :6795.00 Max. :59.700 Max. :27.10 Max. :98.90 Max. :83.40 Max. :11.9500 Max. :142.00 Max. :250.000 Max. :117.60 Max. :52.30 Max. :1726.600 Max. :168.000 Max. :488.0 Max. :8.95 Max. :88.500 Max. :68.40 Max. :33.690 Max. :7.565 Max. :749.500 Max. :0.46000 Max. :2.790 Max. :9.860 Max. :38.820 Max. :32.220 Max. :242.900 Max. :15.00 Max. :50000.0 Max. :113.30 Max. :144.90 Max. :48.50 Max. :2.0900 Max. :21.000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :33.80 Max. :2.600 Max. :70000 Max. :1867.0 Max. :62.20 Max. :5000.000 Max. :190.80 Max. :39.9200 Max. :25.30 Max. :49.20 Max. :732.00 Max. :11.570 Max. :0.12000 Max. :-1 Max. :50.80 Max. :106.40 Max. :320.00 Max. :0.27000 Max. :179.5 Max. :0.510 Max. :110.00 Max. :1508.00 Max. :206.90 Max. :1497.0
NA’s :69 NA’s :1 NA’s :3 NA’s :5 NA’s :44 NA’s :1 NA’s :140 NA’s :1 NA’s :1 NA’s :1 NA’s :141 NA’s :1 NA’s :1 NA’s :1 NA’s :155 NA’s :140 NA’s :2 NA’s :7 NA’s :1 NA’s :1 NA’s :83 NA’s :5 NA’s :83 NA’s :1 NA’s :1 NA’s :1 NA’s :140 NA’s :1 NA’s :59 NA’s :140 NA’s :1 NA’s :1 NA’s :127 NA’s :1 NA’s :1 NA’s :4 NA’s :3 NA’s :5 NA’s :1 NA’s :1 NA’s :11 NA’s :143 NA’s :7 NA’s :59 NA’s :1 NA’s :83 NA’s :15 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :3 NA’s :91 NA’s :1 NA’s :11 NA’s :138 NA’s :155 NA’s :1 NA’s :11 NA’s :1 NA’s :1 NA’s :5 NA’s :1 NA’s :140 NA’s :1 NA’s :59 NA’s :4 NA’s :84 NA’s :3 NA’s :11 NA’s :69 NA’s :1 NA’s :1 NA’s :1

5 patients are removed from the dataset, because they contain more than 40 missing values.

Many columns have more than 70 missing values - they won’t be used for the classification.

class_df <- class_df %>% select(-c(`Interleukin 2 receptor`, `Interleukin 10`, `antithrombin`, `Interleukin 8`, `Quantification of Treponema pallidum antibodies`, `HBsAg`, `Tumor necrosis factor alpha`, `Interleukin 1 beta`, `PH value`, `ferritin`, `Amino-terminal brain natriuretic peptide precursor(NT-proBNP)`, `Interleukin 6` , `Fibrin degradation products`, `2019-nCoV nucleic acid detection`, `HIV antibody quantification`, `Hypersensitive cardiac troponinI`, `HCV antibody quantification`, `ESR`))

class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
hemoglobin Serum chloride Prothrombin time procalcitonin eosinophils(%) Alkaline phosphatase albumin basophil(%) Total bilirubin Platelet count monocytes(%) indirect bilirubin Red blood cell distribution width neutrophils(%) total protein Prothrombin activity mean corpuscular volume hematocrit White blood cell count mean corpuscular hemoglobin concentration fibrinogen Urea lymphocyte count Red blood cell count Eosinophil count Corrected calcium Serum potassium glucose neutrophils count Direct bilirubin Mean platelet volume RBC distribution width SD Thrombin time (%)lymphocyte D-D dimer Total cholesterol aspartate aminotransferase Uric acid HCO3- calcium Lactate dehydrogenase platelet large cell ratio monocytes count PLT distribution width globulin Gamma glutamyl transpeptidase International standard ratio basophil count(#) mean corpuscular hemoglobin Activation of partial thromboplastin time hs_CRP serum sodium thrombocytocrit glutamic-pyruvic transaminase eGFR creatinine
Min. : 6.4 Min. : 77.7 Min. :11.50 Min. : 0.020 Min. :0.0000 Min. : 17.00 Min. :13.60 Min. :0.0000 Min. : 2.80 Min. : -1.0 Min. : 0.600 Min. : 0.100 Min. :10.60 Min. : 1.90 Min. :31.80 Min. : 7.00 Min. : 62.30 Min. :15.60 Min. : 0.710 Min. :286.0 Min. :0.50 Min. : 1.70 Min. : 0.050 Min. : 0.100 Min. :0.00000 Min. :1.650 Min. :2.760 Min. : 1.000 Min. : 0.320 Min. : 1.600 Min. : 8.50 Min. : 31.30 Min. : 13.00 Min. : 0.30 Min. : 0.210 Min. :0.100 Min. : 6.00 Min. : 52.0 Min. : 6.30 Min. :1.170 Min. : 110.0 Min. :11.20 Min. : 0.0100 Min. : 8.10 Min. :10.10 Min. : 7.00 Min. : 0.840 Min. :0.00000 Min. :20.80 Min. : 21.80 Min. : 0.10 Min. :121.1 Min. :0.010 Min. : 5.00 Min. : 2.00 Min. : 14.0
1st Qu.:112.0 1st Qu.: 99.6 1st Qu.:13.40 1st Qu.: 0.030 1st Qu.:0.0000 1st Qu.: 54.00 1st Qu.:28.20 1st Qu.:0.1000 1st Qu.: 7.20 1st Qu.:113.0 1st Qu.: 2.950 1st Qu.: 3.725 1st Qu.:12.00 1st Qu.:61.85 1st Qu.:61.20 1st Qu.: 67.00 1st Qu.: 86.95 1st Qu.:33.00 1st Qu.: 5.115 1st Qu.:332.0 1st Qu.:3.18 1st Qu.: 3.80 1st Qu.: 0.520 1st Qu.: 3.550 1st Qu.:0.00000 1st Qu.:2.260 1st Qu.:4.030 1st Qu.: 5.120 1st Qu.: 3.110 1st Qu.: 3.100 1st Qu.:10.10 1st Qu.: 38.80 1st Qu.: 15.60 1st Qu.: 4.15 1st Qu.: 0.490 1st Qu.:2.950 1st Qu.: 19.00 1st Qu.: 198.5 1st Qu.:20.90 1st Qu.:1.990 1st Qu.: 198.0 1st Qu.:25.30 1st Qu.: 0.3100 1st Qu.:11.00 1st Qu.:28.95 1st Qu.: 21.00 1st Qu.: 1.015 1st Qu.:0.01000 1st Qu.:29.70 1st Qu.: 35.10 1st Qu.: 2.00 1st Qu.:138.3 1st Qu.:0.140 1st Qu.: 17.00 1st Qu.: 66.75 1st Qu.: 58.0
Median :125.0 Median :102.3 Median :14.30 Median : 0.095 Median :0.2000 Median : 71.00 Median :33.20 Median :0.2000 Median : 10.60 Median :190.0 Median : 6.200 Median : 5.300 Median :12.70 Median :77.80 Median :66.00 Median : 86.00 Median : 90.40 Median :36.30 Median : 7.930 Median :342.0 Median :4.22 Median : 5.40 Median : 0.990 Median : 4.100 Median :0.02000 Median :2.370 Median :4.430 Median : 6.540 Median : 5.390 Median : 4.800 Median :10.80 Median : 41.20 Median : 16.50 Median :14.20 Median : 1.330 Median :3.720 Median : 25.00 Median : 260.0 Median :23.90 Median :2.110 Median : 274.0 Median :30.90 Median : 0.4300 Median :12.50 Median :32.40 Median : 33.00 Median : 1.100 Median :0.02000 Median :30.90 Median : 38.90 Median : 26.50 Median :140.7 Median :0.210 Median : 26.00 Median : 89.40 Median : 74.0
Mean :124.4 Mean :103.3 Mean :16.05 Mean : 1.098 Mean :0.8994 Mean : 85.68 Mean :32.65 Mean :0.2631 Mean : 16.54 Mean :192.9 Mean : 6.519 Mean : 6.771 Mean :13.21 Mean :75.47 Mean :65.25 Mean : 81.22 Mean : 90.62 Mean :36.58 Mean : 18.974 Mean :342.1 Mean :4.30 Mean : 9.85 Mean : 1.163 Mean : 8.326 Mean :0.05369 Mean :2.347 Mean :4.501 Mean : 8.525 Mean : 8.017 Mean : 9.788 Mean :10.98 Mean : 42.83 Mean : 17.72 Mean :16.84 Mean : 6.262 Mean :3.745 Mean : 54.35 Mean : 295.4 Mean :23.20 Mean :2.095 Mean : 477.3 Mean :32.24 Mean : 0.5964 Mean :13.24 Mean :32.56 Mean : 49.53 Mean : 1.299 Mean :0.02135 Mean :31.01 Mean : 41.27 Mean : 64.98 Mean :141.8 Mean :0.213 Mean : 42.76 Mean : 81.95 Mean : 117.9
3rd Qu.:138.0 3rd Qu.:105.6 3rd Qu.:16.30 3rd Qu.: 0.450 3rd Qu.:1.5000 3rd Qu.: 98.00 3rd Qu.:37.60 3rd Qu.:0.4000 3rd Qu.: 16.15 3rd Qu.:257.0 3rd Qu.: 8.900 3rd Qu.: 7.900 3rd Qu.:13.80 3rd Qu.:91.95 3rd Qu.:70.40 3rd Qu.: 98.00 3rd Qu.: 94.25 3rd Qu.:40.15 3rd Qu.: 13.170 3rd Qu.:349.0 3rd Qu.:5.41 3rd Qu.:11.50 3rd Qu.: 1.540 3rd Qu.: 4.640 3rd Qu.:0.09000 3rd Qu.:2.450 3rd Qu.:4.820 3rd Qu.: 9.915 3rd Qu.:11.275 3rd Qu.: 7.450 3rd Qu.:11.60 3rd Qu.: 45.30 3rd Qu.: 17.90 3rd Qu.:27.50 3rd Qu.:12.050 3rd Qu.:4.370 3rd Qu.: 41.00 3rd Qu.: 347.9 3rd Qu.:26.35 3rd Qu.:2.220 3rd Qu.: 618.5 3rd Qu.:37.80 3rd Qu.: 0.6100 3rd Qu.:14.50 3rd Qu.:35.75 3rd Qu.: 55.00 3rd Qu.: 1.305 3rd Qu.:0.03000 3rd Qu.:32.20 3rd Qu.: 44.20 3rd Qu.: 99.12 3rd Qu.:143.3 3rd Qu.:0.280 3rd Qu.: 42.00 3rd Qu.:105.00 3rd Qu.: 97.0
Max. :178.0 Max. :140.4 Max. :92.10 Max. :57.170 Max. :8.6000 Max. :620.00 Max. :47.60 Max. :1.7000 Max. :295.40 Max. :554.0 Max. :53.000 Max. :59.700 Max. :27.10 Max. :98.90 Max. :83.40 Max. :142.00 Max. :117.60 Max. :52.30 Max. :1726.600 Max. :488.0 Max. :8.95 Max. :68.40 Max. :33.690 Max. :749.500 Max. :0.46000 Max. :2.790 Max. :9.860 Max. :38.820 Max. :32.220 Max. :242.900 Max. :15.00 Max. :113.30 Max. :144.90 Max. :48.50 Max. :21.000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :33.80 Max. :2.600 Max. :1867.0 Max. :62.20 Max. :39.9200 Max. :25.30 Max. :49.20 Max. :732.00 Max. :11.570 Max. :0.12000 Max. :50.80 Max. :106.40 Max. :320.00 Max. :179.5 Max. :0.510 Max. :1508.00 Max. :206.90 Max. :1497.0
NA’s :1 NA’s :3 NA’s :5 NA’s :44 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :2 NA’s :7 NA’s :1 NA’s :1 NA’s :5 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :59 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :4 NA’s :3 NA’s :5 NA’s :1 NA’s :1 NA’s :11 NA’s :7 NA’s :59 NA’s :1 NA’s :15 NA’s :1 NA’s :1 NA’s :1 NA’s :1 NA’s :3 NA’s :1 NA’s :11 NA’s :1 NA’s :11 NA’s :1 NA’s :1 NA’s :5 NA’s :1 NA’s :1 NA’s :59 NA’s :4 NA’s :3 NA’s :11 NA’s :1 NA’s :1 NA’s :1

The remaining values in the dataset are replaced with median for the whole column - due to very skewed distribution. The summary of clean dataset, with no missing values is presented below.

class_df <- class_df %>% na_mean(option = "median")

class_df %>% select(-c(age,gender, outcome)) %>% summary %>% kbl %>% kable_paper("hover") %>% scroll_box(width = "100%", height = "300px")
hemoglobin Serum chloride Prothrombin time procalcitonin eosinophils(%) Alkaline phosphatase albumin basophil(%) Total bilirubin Platelet count monocytes(%) indirect bilirubin Red blood cell distribution width neutrophils(%) total protein Prothrombin activity mean corpuscular volume hematocrit White blood cell count mean corpuscular hemoglobin concentration fibrinogen Urea lymphocyte count Red blood cell count Eosinophil count Corrected calcium Serum potassium glucose neutrophils count Direct bilirubin Mean platelet volume RBC distribution width SD Thrombin time (%)lymphocyte D-D dimer Total cholesterol aspartate aminotransferase Uric acid HCO3- calcium Lactate dehydrogenase platelet large cell ratio monocytes count PLT distribution width globulin Gamma glutamyl transpeptidase International standard ratio basophil count(#) mean corpuscular hemoglobin Activation of partial thromboplastin time hs_CRP serum sodium thrombocytocrit glutamic-pyruvic transaminase eGFR creatinine
Min. : 6.4 Min. : 77.7 Min. :11.50 Min. : 0.0200 Min. :0.0000 Min. : 17.00 Min. :13.60 Min. :0.0000 Min. : 2.80 Min. : -1.0 Min. : 0.600 Min. : 0.100 Min. :10.6 Min. : 1.90 Min. :31.80 Min. : 7.00 Min. : 62.30 Min. :15.60 Min. : 0.710 Min. :286.0 Min. :0.500 Min. : 1.700 Min. : 0.050 Min. : 0.100 Min. :0.0000 Min. :1.650 Min. :2.760 Min. : 1.000 Min. : 0.320 Min. : 1.600 Min. : 8.50 Min. : 31.3 Min. : 13.00 Min. : 0.300 Min. : 0.210 Min. :0.100 Min. : 6.00 Min. : 52.0 Min. : 6.30 Min. :1.170 Min. : 110.0 Min. :11.2 Min. : 0.0100 Min. : 8.10 Min. :10.10 Min. : 7.00 Min. : 0.840 Min. :0.00000 Min. :20.80 Min. : 21.80 Min. : 0.100 Min. :121.1 Min. :0.0100 Min. : 5.00 Min. : 2.00 Min. : 14.0
1st Qu.:112.0 1st Qu.: 99.6 1st Qu.:13.40 1st Qu.: 0.0400 1st Qu.:0.0000 1st Qu.: 54.00 1st Qu.:28.20 1st Qu.:0.1000 1st Qu.: 7.20 1st Qu.:113.0 1st Qu.: 2.975 1st Qu.: 3.775 1st Qu.:12.1 1st Qu.:61.88 1st Qu.:61.20 1st Qu.: 67.00 1st Qu.: 86.97 1st Qu.:33.00 1st Qu.: 5.117 1st Qu.:332.0 1st Qu.:3.417 1st Qu.: 3.800 1st Qu.: 0.520 1st Qu.: 3.550 1st Qu.:0.0000 1st Qu.:2.260 1st Qu.:4.037 1st Qu.: 5.143 1st Qu.: 3.115 1st Qu.: 3.100 1st Qu.:10.10 1st Qu.: 38.8 1st Qu.: 15.80 1st Qu.: 4.175 1st Qu.: 0.510 1st Qu.:2.950 1st Qu.: 19.00 1st Qu.: 198.8 1st Qu.:20.90 1st Qu.:1.990 1st Qu.: 199.0 1st Qu.:25.4 1st Qu.: 0.3100 1st Qu.:11.07 1st Qu.:28.98 1st Qu.: 21.00 1st Qu.: 1.020 1st Qu.:0.01000 1st Qu.:29.70 1st Qu.: 36.08 1st Qu.: 2.075 1st Qu.:138.3 1st Qu.:0.1400 1st Qu.: 17.00 1st Qu.: 66.78 1st Qu.: 58.0
Median :125.0 Median :102.3 Median :14.30 Median : 0.0950 Median :0.2000 Median : 71.00 Median :33.20 Median :0.2000 Median : 10.60 Median :190.0 Median : 6.200 Median : 5.300 Median :12.7 Median :77.80 Median :66.00 Median : 86.00 Median : 90.40 Median :36.30 Median : 7.930 Median :342.0 Median :4.220 Median : 5.400 Median : 0.990 Median : 4.100 Median :0.0200 Median :2.370 Median :4.430 Median : 6.540 Median : 5.390 Median : 4.800 Median :10.80 Median : 41.2 Median : 16.50 Median :14.200 Median : 1.330 Median :3.720 Median : 25.00 Median : 260.0 Median :23.90 Median :2.110 Median : 274.0 Median :30.9 Median : 0.4300 Median :12.50 Median :32.40 Median : 33.00 Median : 1.100 Median :0.02000 Median :30.90 Median : 38.90 Median : 26.500 Median :140.7 Median :0.2100 Median : 26.00 Median : 89.40 Median : 74.0
Mean :124.4 Mean :103.3 Mean :16.02 Mean : 0.9742 Mean :0.8975 Mean : 85.64 Mean :32.66 Mean :0.2629 Mean : 16.52 Mean :192.9 Mean : 6.518 Mean : 6.762 Mean :13.2 Mean :75.48 Mean :65.25 Mean : 81.28 Mean : 90.62 Mean :36.58 Mean : 18.943 Mean :342.1 Mean :4.286 Mean : 9.838 Mean : 1.162 Mean : 8.314 Mean :0.0536 Mean :2.347 Mean :4.500 Mean : 8.497 Mean : 8.009 Mean : 9.774 Mean :10.97 Mean : 42.8 Mean : 17.52 Mean :16.837 Mean : 6.054 Mean :3.745 Mean : 54.27 Mean : 295.3 Mean :23.20 Mean :2.095 Mean : 476.7 Mean :32.2 Mean : 0.5959 Mean :13.21 Mean :32.56 Mean : 49.49 Mean : 1.296 Mean :0.02135 Mean :31.01 Mean : 40.88 Mean : 64.543 Mean :141.8 Mean :0.2129 Mean : 42.72 Mean : 81.97 Mean : 117.7
3rd Qu.:138.0 3rd Qu.:105.5 3rd Qu.:16.30 3rd Qu.: 0.3525 3rd Qu.:1.5000 3rd Qu.: 98.00 3rd Qu.:37.60 3rd Qu.:0.4000 3rd Qu.: 16.12 3rd Qu.:257.0 3rd Qu.: 8.900 3rd Qu.: 7.900 3rd Qu.:13.8 3rd Qu.:91.92 3rd Qu.:70.40 3rd Qu.: 97.25 3rd Qu.: 94.22 3rd Qu.:40.12 3rd Qu.: 13.155 3rd Qu.:349.0 3rd Qu.:5.145 3rd Qu.:11.500 3rd Qu.: 1.540 3rd Qu.: 4.635 3rd Qu.:0.0900 3rd Qu.:2.450 3rd Qu.:4.812 3rd Qu.: 9.675 3rd Qu.:11.242 3rd Qu.: 7.425 3rd Qu.:11.60 3rd Qu.: 45.2 3rd Qu.: 17.50 3rd Qu.:27.500 3rd Qu.:10.515 3rd Qu.:4.370 3rd Qu.: 41.00 3rd Qu.: 347.4 3rd Qu.:26.32 3rd Qu.:2.220 3rd Qu.: 617.8 3rd Qu.:37.6 3rd Qu.: 0.6100 3rd Qu.:14.40 3rd Qu.:35.73 3rd Qu.: 55.00 3rd Qu.: 1.300 3rd Qu.:0.03000 3rd Qu.:32.20 3rd Qu.: 42.83 3rd Qu.: 98.950 3rd Qu.:143.3 3rd Qu.:0.2700 3rd Qu.: 42.00 3rd Qu.:105.00 3rd Qu.: 97.0
Max. :178.0 Max. :140.4 Max. :92.10 Max. :57.1700 Max. :8.6000 Max. :620.00 Max. :47.60 Max. :1.7000 Max. :295.40 Max. :554.0 Max. :53.000 Max. :59.700 Max. :27.1 Max. :98.90 Max. :83.40 Max. :142.00 Max. :117.60 Max. :52.30 Max. :1726.600 Max. :488.0 Max. :8.950 Max. :68.400 Max. :33.690 Max. :749.500 Max. :0.4600 Max. :2.790 Max. :9.860 Max. :38.820 Max. :32.220 Max. :242.900 Max. :15.00 Max. :113.3 Max. :144.90 Max. :48.500 Max. :21.000 Max. :7.300 Max. :1858.00 Max. :1176.0 Max. :33.80 Max. :2.600 Max. :1867.0 Max. :62.2 Max. :39.9200 Max. :25.30 Max. :49.20 Max. :732.00 Max. :11.570 Max. :0.12000 Max. :50.80 Max. :106.40 Max. :320.000 Max. :179.5 Max. :0.5100 Max. :1508.00 Max. :206.90 Max. :1497.0

Dataset shuffle and split

The preprocessed data is split into two datasets training and testing.

The patients are grouped by outcome, first in the dataset are patients who survived and than those who died. Below dataset shuffle and check is done to be sure that the training and testing sets have similar output class distribution.

To ensure the repeatability of experiments, seed is set to 23.

set.seed(23)
rows <- sample(nrow(class_df))
class_df <- class_df[rows,]


set.seed(23)
inTraining <- createDataPartition(y = class_df$outcome, p=.70, list=FALSE)
training <- class_df[inTraining,]
testing <- class_df[-inTraining,]

Training set summary

training %>% select(gender, outcome) %>% tbl_summary(by = outcome) %>%  as_kable()  %>% kable_paper("hover")
Characteristic Died, N = 115 Survived, N = 136
gender
female 36 (31%) 77 (57%)
male 79 (69%) 59 (43%)

Testing set summary

testing %>% select(gender, outcome) %>% tbl_summary(by = outcome) %>%  as_kable()  %>% kable_paper("hover")
Characteristic Died, N = 48 Survived, N = 57
gender
female 9 (19%) 25 (44%)
male 39 (81%) 32 (56%)

Train control

For the learning process Repeated 2 fold Cross-Validation was used - the training process will be repeated 5 times.

set.seed(23)
ctrl <- trainControl(
    method = "repeatedcv",
    number = 2,
    repeats = 5,
    classProbs = TRUE)

Te measure the performance of the model three measures are considered: accuracy, ROC curve, and AUC.

Random Forest

Model training

The Random Forest model is trained with default parameters, but with a number of trees in the forest set to 10 and metric used for tuning the model as ROC.

rfGrid <- expand.grid(mtry = 10:20)

set.seed(23)
rf_fit <- train(outcome ~ ., 
                data = training, 
                method = "rf",
                preProc = c("center", "scale"),
                trControl = ctrl,
                tuneGrid = rfGrid,
                ntree = 15)

rf_fit
## Random Forest 
## 
## 251 samples
##  58 predictor
##   2 classes: 'Died', 'Survived' 
## 
## Pre-processing: centered (58), scaled (58) 
## Resampling: Cross-Validated (2 fold, repeated 5 times) 
## Summary of sample sizes: 126, 125, 125, 126, 125, 126, ... 
## Resampling results across tuning parameters:
## 
##   mtry  Accuracy   Kappa    
##   10    0.9553587  0.9101052
##   11    0.9593079  0.9182601
##   12    0.9593460  0.9180843
##   13    0.9577079  0.9149460
##   14    0.9593143  0.9180618
##   15    0.9601206  0.9198597
##   16    0.9584952  0.9168637
##   17    0.9601206  0.9198335
##   18    0.9625206  0.9245983
##   19    0.9633270  0.9262290
##   20    0.9649016  0.9294723
## 
## Accuracy was used to select the optimal model using the largest value.
## The final value used for the model was mtry = 20.

Prediction

The model works very well with 97% accuracy. There are 3 badly classified patients, but this type of error is less harmful (FN).

rf_classes <- predict(rf_fit, newdata = testing)
rf_classes_prob <- predict(rf_fit, newdata = testing, type = "prob")
caret::confusionMatrix(data = rf_classes, testing$outcome)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction Died Survived
##   Died       48        3
##   Survived    0       54
##                                           
##                Accuracy : 0.9714          
##                  95% CI : (0.9188, 0.9941)
##     No Information Rate : 0.5429          
##     P-Value [Acc > NIR] : <2e-16          
##                                           
##                   Kappa : 0.9427          
##                                           
##  Mcnemar's Test P-Value : 0.2482          
##                                           
##             Sensitivity : 1.0000          
##             Specificity : 0.9474          
##          Pos Pred Value : 0.9412          
##          Neg Pred Value : 1.0000          
##              Prevalence : 0.4571          
##          Detection Rate : 0.4571          
##    Detection Prevalence : 0.4857          
##       Balanced Accuracy : 0.9737          
##                                           
##        'Positive' Class : Died            
## 

ROC and AUC

Presented ROC curve is very convex which means that the model works very well. Also the AUC value is very high, close to 1.

rf_ROC <- roc(response = testing$outcome, 
              predictor = rf_classes_prob[, "Died"],
              levels = rev(levels(testing$outcome)),
              plot = TRUE,
              auc = TRUE,
              print.auc = TRUE)

rf_ROC
## 
## Call:
## roc.default(response = testing$outcome, predictor = rf_classes_prob[,     "Died"], levels = rev(levels(testing$outcome)), auc = TRUE,     plot = TRUE, print.auc = TRUE)
## 
## Data: rf_classes_prob[, "Died"] in 57 controls (testing$outcome Survived) < 48 cases (testing$outcome Died).
## Area under the curve: 0.9987

Feauture importance

Presented below classification variables and their importance show, that there are just few very important, decisive variable which model uses. Three variables: LDH, lymphocyte, and hs-CRP, are marked as the most important variables to predict the mortality of Covid-19 patients - the same as in the article An interpretable mortality prediction model for COVID-19 patients.

importance <- varImp(rf_fit)

importance
## rf variable importance
## 
##   only 20 most important variables shown (out of 58)
## 
##                                             Overall
## `Lactate dehydrogenase`                     100.000
## `(%)lymphocyte`                              33.492
## hs_CRP                                       27.732
## `lymphocyte count`                           15.807
## `neutrophils(%)`                             13.289
## `Prothrombin time`                           12.755
## `International standard ratio`               10.094
## `Platelet count`                              4.623
## `aspartate aminotransferase`                  3.872
## procalcitonin                                 3.417
## `HCO3-`                                       2.343
## `Total cholesterol`                           2.080
## `platelet large cell ratio`                   2.016
## thrombocytocrit                               2.014
## age                                           2.013
## `glutamic-pyruvic transaminase`               1.648
## `Activation of partial thromboplastin time`   1.637
## eGFR                                          1.266
## glucose                                       1.240
## `eosinophils(%)`                              1.212

Further development

As further work development, the most important variables (importance > 5) could be used to train the model to get 100% classification accuracy. Extra model evaluation could be done with learning curves to detect overfitting.